Purpose

Find MDCs associated with Medicaid and/or Private Insurance payer types.

Analyzing MDC codes from all admissions in HCUP NY SID 2006-2012.

Plots

Scatterplot

Ordered by ratio of Medicaid to Private Insurance admissions

Ordered by proportion of Medicaid admissions

Ordered by counts of Medicaid admission

K-means

Background & Algorithm

K-means clustering classifies MDCs into k groups such that MDCs within the same cluster are as similar as possible, and MDCs from different clusters are as dissimilar as possible. For our data, similarity is represented by the number of discharges/admissions from each payer type.

K-means defines clusters by trying to minimize the total within-cluster variation. The standard algorithm (Hartigan-Wong (1979)) defines the within-cluster variation as the sum of squared Euclidean distances between each MDC and its corresponding cluster centroid:

\[W(C_k) = \sum_{x_i \in C_k} (x_i - \mu_k)^2\]

where:

  • \(x_i\) is an MDC belonging to cluster \(C_k\)
  • \(\mu_k\) is the mean value of the MDCs assigned to cluster \(C_k\). This is a vector of the means of all discharges by payer type for all MDCs in the cluster.

The algorithm tries the minimize the total within-cluster varition:

\[Total.Within.SS = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2\]

K-means algorithm can be summarized as:

  1. Specify the number of clusters (k).
  2. Select randomly k MDCs from the data as the initial cluster centroid/means.
  3. Assigns each MDC to their closest centroid, based on Euclidean distance.
  4. For each of the k clusters update the cluster centroid by recalculating mean values of all MDCs in the cluster.
  5. Iteratively minimize the total within sum of squares. I.e. repeat steps 3 and 4 until cluster assignments stop changing or a user-specified maximum number of iterations is reached.

Cluster Results

Implemented k-means clustering for \(k=[2,15]\). Visual of clusters for \(k=[2,6]\).

Determining Optimal Clusters

Recall k-means defines clusters by minimizing the the total within-cluster variation (Total.Within.SS). We can plot the Total.Within.SS against the number of clusters k to decide the optimal number of clusters.

As k increases, the Total.Within.SS approaches 0. Generally, researchers use the “elbow method” of finding the value of k where the line bends as the point where there are diminishing returns in reducing the Total.Within.SS.